Kaggle实战之Titanic

题目来自：Titanic
参考资料来自：An Interactive Data Science Tutorial
Titanic 生存预测比赛是一个二分类问题，根据乘客的信息来判断是否在沉船事故中存活了下来。

首先还是导入必要的库：

# Ignore warnings
import warnings
warnings.filterwarnings('ignore')

# Handle table-like data and matrices
import numpy as np
import pandas as pd

# Modelling Algorithms
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier , GradientBoostingClassifier

# Modelling Helpers
from sklearn.preprocessing import Imputer , Normalizer , scale
from sklearn.model_selection import train_test_split , StratifiedKFold
from sklearn.feature_selection import RFECV

# Visualisation
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.pylab as pylab
import seaborn as sns

# Configure visualisations
%matplotlib inline
mpl.style.use( 'ggplot' )
sns.set_style( 'white' )
pylab.rcParams[ 'figure.figsize' ] = 8 , 6

其次是一些用于绘图的功能性函数：

def plot_histograms( df , variables , n_rows , n_cols ):
    fig = plt.figure( figsize = ( 16 , 12 ) )
    for i, var_name in enumerate( variables ):
        ax=fig.add_subplot( n_rows , n_cols , i+1 )
        df[ var_name ].hist( bins=10 , ax=ax )
        ax.set_title( 'Skew: ' + str( round( float( df[ var_name ].skew() ) , ) ) ) # + ' ' + var_name ) #var_name+" Distribution")
        ax.set_xticklabels( [] , visible=False )
        ax.set_yticklabels( [] , visible=False )
    fig.tight_layout()  # Improves appearance a bit.
    plt.show()

def plot_distribution( df , var , target , **kwargs ):
    row = kwargs.get( 'row' , None )
    col = kwargs.get( 'col' , None )
    facet = sns.FacetGrid( df , hue=target , aspect=4 , row = row , col = col )
    facet.map( sns.kdeplot , var , shade= True )
    facet.set( xlim=( 0 , df[ var ].max() ) )
    facet.add_legend()

def plot_categories( df , cat , target , **kwargs ):
    row = kwargs.get( 'row' , None )
    col = kwargs.get( 'col' , None )
    facet = sns.FacetGrid( df , row = row , col = col )
    facet.map( sns.barplot , cat , target )
    facet.add_legend()

def plot_correlation_map( df ):
    corr = titanic.corr()
    _ , ax = plt.subplots( figsize =( 12 , 10 ) )
    cmap = sns.diverging_palette( 220 , 10 , as_cmap = True )
    _ = sns.heatmap(
        corr, 
        cmap = cmap,
        square=True, 
        cbar_kws={ 'shrink' : .9 }, 
        ax=ax, 
        annot = True, 
        annot_kws = { 'fontsize' : 12 }
    )

def describe_more( df ):
    var = [] ; l = [] ; t = []
    for x in df:
        var.append( x )
        l.append( len( pd.value_counts( df[ x ] ) ) )
        t.append( df[ x ].dtypes )
    levels = pd.DataFrame( { 'Variable' : var , 'Levels' : l , 'Datatype' : t } )
    levels.sort_values( by = 'Levels' , inplace = True )
    return levels

def plot_variable_importance( X , y ):
    tree = DecisionTreeClassifier( random_state = 99 )
    tree.fit( X , y )
    plot_model_var_imp( tree , X , y )
    
def plot_model_var_imp( model , X , y ):
    imp = pd.DataFrame( 
        model.feature_importances_  , 
        columns = [ 'Importance' ] , 
        index = X.columns 
    )
    imp = imp.sort_values( [ 'Importance' ] , ascending = True )
    imp[ : 10 ].plot( kind = 'barh' )
    print (model.score( X , y ))

训练集与测试集

接下来就是导入训练集和测试集了，以及对两个数据集进行了合并，以便于后面进行数据分析，特征工程等：

train_data = pd.read_csv('train.csv')
test_data = pd.read_csv('test.csv')
full = train_data.append(test_data, ignore_index=True)
titanic = full[:891]    # full是整个数据集，titanic是训练集
print('full:', full.shape, ';titanic:', titanic.shape)

输出：

full: (1309, 12) ;titanic: (891, 12)

数据分析

使用full.head()可以查看前几个数据的样式，如下所示：

使用titanic.info可以获取训练集每个column数据信息：

使用test_data.info可以获取测试集每个column数据信息:

关于各个column的信息如下：
Age：年龄，有中等数量的缺失
cabin：座舱号，有大量的缺失
Embarked：登船口，在训练集中有很少量的缺失(2个)，包括C，Q，S三种
Fare：乘客的票价，在测试集中有很少量的缺失(1个)
Name：姓名
Parch：乘客的父母和孩子的个数
PassengerId: 自增数值，无意义
Pclass：票的等级，有三级：1，2，3
Sex：性别，male和female
SibSp：乘客的兄弟和配偶的个数
Survived：是否存活，0 = No, 1 = Yes
ticket：票的编号

绘制相关性的heat map，可能可以知道哪些变量是很重要的

1	plot_correlation_map(titanic)

输出如下图：

接下来绘制一些特征与存活与否之间的关系
首先是Age，Sex与Survived关系图：

1	plot_distribution(titanic , var = 'Age' , target = 'Survived' , row = 'Sex')

输出如下图：

两个线差别较大的地方，代表了更好的区分度。可以看到年龄小的男性更多的存活，中等年龄的男性更多的死亡

Fare和Survived的关系图：

1	plot_distribution(titanic , var = 'Fare' , target = 'Survived' )

输出如下图：

可以看到，低票价有着更高的死亡率

接下来看Embarked与Survived的关系：

1 2	print(titanic.Embarked.value_counts()) plot_categories( titanic , cat = 'Embarked' , target = 'Survived' )

输出：

S 644
C 168
Q 77
Name: Embarked, dtype: int64

可以看到S的数目是最多的，但是存活率是最低的

再看Sex与Survived的关系：

1 2	print(titanic.Sex.value_counts()) plot_categories( titanic , cat = 'Sex' , target = 'Survived' )

输出：

male 577
female 314
Name: Sex, dtype: int64

可以看到女性人数少，但是有着绝对的更大的存活率。

对于Pclass与Survived的关系

1 2	print(titanic.Pclass.value_counts()) plot_categories( titanic , cat = 'Pclass' , target = 'Survived' )

输出：

3 491
1 216
2 184
Name: Pclass, dtype: int64

等级1人最少，却有着最多的存活率，等级3人最多，却是最少的存活率

对于SibSp和Parch两个数据，可以进行求和，并分成0和不为0两类

titanic['Family_All'] = titanic['SibSp'] + titanic['Parch']
titanic['Family_All'] = [0 if i == 0 else 1 for i in titanic.Family_All]
print(titanic.Family_All.value_counts())
plot_categories( titanic , cat = 'Family_All' , target = 'Survived' )

输出：

0 537
1 354
Name: Family_All, dtype: int64

可以看到为0的，存活率相较于不为0的，是要低很多的

大致的通过图表分析过后，对原始数据进行些处理。

首先是将sex的male和female转为1和0

1 2	my_sex = pd.DataFrame() my_sex['Sex'] = [1 if i == 'male' else 0 for i in full.Sex]

Embarked数据存在极少量的缺失，这里打算用最多的‘S’来填补，再使用pd.get_dummies来将多个变量转为one_hot编码

my_embarked = pd.DataFrame()
my_embarked['Embarked'] = full.Embarked.fillna('S')
my_embarked = pd.get_dummies(my_embarked.Embarked, prefix = 'Embarked')
my_embarked.head()

输出：

对于Pclass，没有缺失值，只需要转为one_hot即可

1 2	my_pclass = pd.DataFrame() my_pclass = pd.get_dummies(full.Pclass, prefix='Pclass')

对于Fare，由于在测试集中有一个缺失值，所以可以采用平均数的方法来填补该缺失值，并且可以对Fare进行区间划分,并转为one_hot

my_fare = pd.DataFrame()
my_fare['Fare'] = full.Fare.fillna(full.Fare.mean())
my_fare['Fare'] = pd.qcut(my_fare['Fare'], 4)
my_fare = pd.get_dummies(my_fare.Fare, prefix='Fare')
my_fare.head()

对于Age，缺失值较多，可以根据已有数据的平均值和标准差随机生成填充数，并进行区间划分，转为one_hot

my_age = pd.DataFrame()
my_age['Age'] = full.Age
age_avg = full.Age.mean()
age_std = full.Age.std()
age_null_count = full.Age.isnull().sum()
age_null_random_list = np.random.randint(age_avg - age_std, age_avg + age_std, size = age_null_count)
my_age['Age'][np.isnan(my_age['Age'])] = age_null_random_list
my_age['Age'] = pd.qcut(my_age['Age'], 4)
my_age = pd.get_dummies(my_age.Age, prefix='Age')
my_age.head()

根据Name中的内容可以生成title，并转为one_hot

title = pd.DataFrame()
title['Title'] = full['Name'].map( lambda name: name.split( ',' )[1].split( '.' )[0].strip() )
Title_Dictionary = {
                    "Capt":       "Officer",
                    "Col":        "Officer",
                    "Major":      "Officer",
                    "Jonkheer":   "Royalty",
                    "Don":        "Royalty",
                    "Sir" :       "Royalty",
                    "Dr":         "Officer",
                    "Rev":        "Officer",
                    "the Countess":"Royalty",
                    "Dona":       "Royalty",
                    "Mme":        "Mrs",
                    "Mlle":       "Miss",
                    "Ms":         "Mrs",
                    "Mr" :        "Mr",
                    "Mrs" :       "Mrs",
                    "Miss" :      "Miss",
                    "Master" :    "Master",
                    "Lady" :      "Royalty"
                    }
title['Title'] = title.Title.map(Title_Dictionary)
title = pd.get_dummies(title.Title)
title.head()

输出：

对于Parch和SibSp，合并为Family_All

1
2
3

my_family = pd.DataFrame()
my_family['Family_All'] = full['Parch'] + full['SibSp']
my_family['Family_All'] = [0 if i == 0 else 1 for i in my_family.Family_All]

Cabin的缺失值过多，先暂时舍弃
ticket也先舍弃

开始训练

将刚才处理过的数据进行一个综合，并生成训练集和测试集

full_X = pd.concat( [my_family, title, my_age, my_embarked, my_fare, my_pclass, my_sex] , axis=1 )
train_X = full_X[0:891]
train_y = titanic.Survived
test_X = full_X[891:]

选择模型，并进行5折交叉验证

from sklearn.model_selection import cross_val_score
model = GradientBoostingClassifier(learning_rate=0.01, max_depth=3, n_estimators=150)
# model = SVC()
# model = RandomForestClassifier(n_estimators=100)
# model = DecisionTreeClassifier()
cross_val_score(model, train_X, train_y, cv=5).mean()

输出：

0.8215596071618176

最后进行训练与预测

model.fit( train_X , train_y )
test_Y = model.predict( test_X )
passenger_id = full[891:].PassengerId
test = pd.DataFrame( { 'PassengerId': passenger_id , 'Survived': test_Y } )
test.to_csv( 'titanic_pred.csv' , index = False )

最后上传到kaggle的成绩是0.78947，排在Top32%